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Abstract 

Generality,  representation,  and  control  have  been  the  central  issues  in  machine 
recognition.  Model-baaed  recognition  is  the  search  for  consistent  matches  of  the 
model  and  image  features.  We  present  a  comparative  framework  for  the  evaluation 
of  different  approaches,  particularly  those  of  ACRONYM,  RAF,  and  Ikeuchi  et  al. 
The  strengths  and  weaknesses  of  these  approaches  are  discussed  and  compared  and 
the  remedies  are  suggested.  Various  tradeoffs  made  in  the  implementations  are  an¬ 
alyzed  with  respect  to  the  systems’  intended  task-domains.  The  requirements  for 
a  versatile  recognition  system  are  motivated.  Several  directions  for  future  research 
are  pointed  out. 

Keywords:  computer  vision,  model-based  recognition,  representation,  object  mod¬ 
eling,  search  control,  consistent  labeling. 


1  Introduction 


The  task  of  making  a  computer  “see”  things  has  fascinated  many  researchers  for  several 
decades.  The  task  is  challenging  enough  for  its  own  sake.  The  results  can  shed  light  on 
how  human  vision  works,  and  can  be  used  in  many  practical  applications.  Although  re¬ 
search  in  neurophysiology  has  revealed  much  biological  evidence  that  might  suggest  how 
the  human  vision  system  works,  there  has  been  little  progress  towards  general  under¬ 
standing  of  human  vision.  David  Marr  [72],  on  the  other  hand,  approached  the  problem 
from  a  computational  point  of  view  and  defined  vision  as  achieving  a  particular  visual 
task.  The  task  can  be  analyzed  and  implemented  computationally.  This  task-oriented, 
computational  approach  has  stimulated  much  creative  research,  horn  which  many  recent 
approaches  have  stemmed. 
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1.1  Machine  recognition 

A  recognition  system  is  usually  part  of  a  larger  system  that  senses  the  environment, 
interprets  the  sensed  data,  understands  the  objects  in  the  scene,  and  acts  upon  the  world. 
The  goal  of  machine  recognition  is  thus,  given  sensed  data  and  a  priori  knowledge  about 
the  world,  to  "see”  what  objects  are  in  the  scene  and,  if  there  are  any,  the  precise  positions 
and  orientations  of  the  objects.  Besl  and  Jain  [3]  give  a  mathematical  formulation  of 
machine  recognition,  defined  as  an  inverse  mapping  of  scene  to  image  projection. 

This  paper  reviews  three  different  approaches  [16,  33,  60]  to  object  recognition  in 
computer  vision.  These  approaches,  along  with  other  work,  constitute  an  important  class 
of  machine  recognition  methods.  They  axe  instances  of  the  successful  paradigm  in  machine 
recognition  —  model-based  recognition  [3,  6,  19]  —  but  differ  in  their  representation  of 
objects,  handling  of  controls,  and  intended  task  domains.  In  this  paradigm  an  observed 
object  is  recognized  as  being  an  instance  of  a  model  if  their  respective  primitives  are 
pairwise  consistently  matched  [15].  The  paradigm  has  its  origin  in  the  work  of  scene 
analysis  [77,  41,  49,  20,  85].  The  representation  of  objects  and  images,  the  selection  of 
data,  and  the  control  of  the  matching  process  have  been  the  central  issues  in  model-based 
machine  recognition  and  have  been  actively  researched. 

The  ACRONYM  system  [16]  is  a  general,  domain  independent  system.  It  recognizes 
objects  in  the  image  and  determines  the  positions  of  the  objects.  Geometric  constraints 
are  used  to  relate  parts  of  objects.  The  RAF  system  [33]  handles  overlapping  objects 
in  a  polyhedral  world.  It  searches  an  interpretation  tree  that  represents  all  the  matches 
to  be  considered.  Local  geometric  constraints  are  used  to  drastically  reduce  the  size  of 
search.  Ikeuchi  et  al.’s  system  [60]  works  in  bin-picking  task  domains.  It  assumes  that 
only  instances  of  one  object  are  present  in  the  image  and  does  not  handle  occlusion.  On 
the  other  hand,  Ikeuchi  et  al.  present  a  complete  methodology  of  automatic  programming 
for  machine  recognition. 

In  this  paper  we  will  motivate  the  central  issues  of  machine  recognition  —  GENER¬ 
ALITY,  REPRESENTATION,  and  CONTROL  —  and  cast  recognition  as  representation  and 
control  of  search,  driven  by  its  intended  generality.  We  will  then  discuss  how  different 
approaches  can  be  compared  in  a  framework  defined  by  the  three  axes:  generality,  rep¬ 
resentation,  and  control.  We  will  analyze  various  tradeoffs  made  in  the  implementations 
with  respect  to  systems’  task  domains. 

1.2  Putting  machine  recognition  into  a  larger  context 

There  has  been  extensive  research  in  the  classical  areas  of  image  processing,  early  vision, 
pattern  recognition,  and  scene  analysis. 

Image  processing  takes  images  and  produces  “enhanced”  images  that  are  useful  as 
inputs  to  vision  systems.  Early  vision  systems  produce  descriptions  of  the  input  image, 
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for  example  intensity  changes  like  edges  [45]  and  needle  maps  [70,  47],  and  group  them 
into  regions  based  onto  some  similarity  measures  [72].  Recognition  systems  use  the  out¬ 
puts  of  early  vision  systems  to  identify  scene  objects  and  determine  their  positions  and 
orientations  [3].  Many  techniques  of  early  vision  systems  have  been  used  in  machine 
recognition. 

Machine  recognition  is  closest  to  pattern  recognition  and  scene  analysis.  Pattern  recog¬ 
nition  [23]  classifies  an  object  from  a  vector  of  primitive  features  that  are  measurements 
of  some  global  characteristics  of  the  object.  These  measurements  include  area,  perimeter, 
center  of  mass,  moments  of  inertia  (for  orientation),  ratio  of  murimum  and  minimum 
inertia  (for  elongation).  The  feature  vectors  form  a  feature  space.  The  hope  is  that  the 
feature  vectors  of  different  classes  of  objects  are  spatially  apart  in  the  feature  space  so 
that  some  simple  techniques  like  nearest  neighbor  method  can  be  used  to  cluster  features 
into  distinct  classes.  Classification  of  an  unknown  object  is  to  determine  which  cluster 
it  belongs  to.  The  closeness  of  a  feature  point  to  a  class  is  measured  by  some  statistical 
quantities.  Therefore  pattern  recognition  proceeds  in  two  steps,  feature  extraction  and 
pattern  classification.  The  features  used  are  generally  global  in  nature;  Many  rich  local 
geometric  relations  are  left  out.  Also,  it  is  hard  to  find  features  that  are  invariant  with 
respect  to  variations,  such  as  those  in  viewing  directions,  and  extract  them  reliably.  Many 
techniques  developed  for  feature  extraction  and  classification  have  been  used  in  machine 
recognition. 

Scene  analysis  [23,  87,  86]  interprets  a  description  of  some  scene  and  assigns  scene 
objects  with  known  labels.  It  consistently  labels  objects  in  the  scene,  subject  to  constraints 
derived  from  the  objects  and  images,  to  give  a  meaningful  interpretation  of  the  scene.  The 
research  on  line  labeling  of  polyhedral  scenes  started  with  Robert’s  seminal  work  [77],  and 
was  followed  by  Guzman  [41],  Huffman  [49],  Clowes  [20],  and  Waltz  [85].  We  will  discuss 
in  some  details  the  work  on  line  drawing  interpretation  in  Section  3.4.1.  Although  it  does 
not  concern  the  problem  of  producing  reliable  picture  descriptions,  the  line  drawings,  the 
techniques  of  feature  representations,  and  the  control  of  consistent  labeling  have  been 
extensively  used  in  many  approaches  to  machine  recognition  today. 


2  A  Comparative  Framework  for  Machine  Recogni¬ 
tion 

2.1  Model-based  recognition 

Model-based  recognition  assumes  the  existence  of  a  library  of  models.  Models  can  be 
specified  in  terms  of  geometric  properties  of  the  objects,  or  with  other  properties  if  avail¬ 
able.  Models  and  images  can  be  described  by  a  set  of  features  and  constraints  that  relate 
the  features.  The  features  and  constraints  form  a  network.  Recognition  is  the  search 
for  consistent  matches  (which  constitute  the  identification  of  the  image  object)  subject 
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to  constraints,  in  the  space  of  all  possible  matches  between  model  features  and  image 
features,  and  the  determination  of  a  transformation  from  the  model  to  the  image  object. 

Three  basic  questions  have  often  been  asked  in  machine  recognition: 

1.  What  constitute  useful  features  and  constraints? 

2.  How  can  features  and  constraints  be  reliably  extracted? 

3.  How  can  features  and  constraints  be  used  for  recognition? 

In  this  paper,  we  will  focus  on  high-level  aspects  of  these  questions.  We  concentrate 
mainly  on  the  questions  one  and  three.  Although  much  of  the  success  of  machine  recog¬ 
nition  depends  on  the  availability  of  features  and  constraints  provided  by  early  vision 
modules,  in  this  paper  we  emphasize  the  form  and  the  use  of  features  and  constraints. 

We  identify  the  three  central  issues  of  machine  recognition:  GENERALITY,  REPRESEN¬ 
TATION,  and  CONTROL,  with  respect  to  which  a  comparative  framework  is  defined.  We 
compare  different  approaches  to  machine  recognition  in  this  comparative  framework,  see 
how  each  of  their  intended  generalities  defines  a  subspace,  and  discuss  how  tradeoffs  are 
made  within  the  subspace  and  across  several  subspaces.  Figure  1  shows  this  comparative 
framework.  Each  subspace  is  identified  with  a  plane. 

We  will  evaluate  a  recognition  system  with  its  two  components,  the  representation  and 
the  control  of  search,  and  discuss  how  the  design  of  the  system  is  driven  by  its  intended 
generality.  The  following  metrics  are  used  for  the  evaluation  of  each  approach: 

•  performance  on  its  intended  task, 

•  sensitivity  to  noise  and  occlusion,  and 

•  its  task-domain-independent  mechanisms  that  can  be  generalized. 

If  we  view  the  task  of  recognition  as  being  composed  of  two  parts:  the  identification 
of  an  image  object  and  the  localization  (position  and  orientation)  of  the  object,  the  three 
approaches  that  we  will  discuss  in  details  use  more  or  less  the  same  method:  hypothesize  a 
match  between  the  image  object  and  a  model  and  verify  that  the  match  actually  produces 
a  legal  interpretation  of  the  image.  Other  methods  like  the  Alignment  Method  [83]  proceed 
by  hypothesizing  the  position  and  orientation  first,  and  use  the  hypothesized  model  to 
constrain  the  search  in  the  object  identification.  The  theme  of  all  these  approaches  is  the 
same:  hypothesize  and  verify.  They  differ  in  how  much  and  how  strong  the  hypotheses 
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Figure  1:  A  comparative  framework  defined  by  generality,  representation,  and  control. 

2.2  The  comparative  framework:  generality,  representation, 
and  control 

2.2.1  Generality 

There  have  been  two  contrasting  trends  in  approaches  to  machine  recognition: 

•  understanding  the  principles  of  human  vision  and  general  methodologies  of  vision 
systems; 

•  achieving  high  performance  for  industrial  applications. 

The  first  school  of  approaches  attempt  to  answer  the  question  of  how  the  human  vi¬ 
sion  system  works.  They  hope  to  formulate  a  general  theory  and  methodology  for  solving 
machine  recognition  problems.  The  systems  thus  built  embody  mostly  domain  indepen¬ 
dent  principles.  The  other  group  tries  to  find  solutions  to  many  specialized  industrial 
applications.  It  deals  with  restricted  situations  that  demand  high  accuracy  and  speed, 
such  as  robot  arms  picking  an  object  from  a  pile  on  some  fiat  surface. 


1.  The  Dilemma:  Generality  or  Performance? 

It  is  not  uncommon  to  see  that  systems  aiming  for  generality  either  suffer  from 
complexity  and  system  bugs,  or  work  only  on  toy  problems,  and  often  do  not  de¬ 
liver  their  promises.  On  the  other  hand  many  recognition  systems  working  well 
on  specific  industrial  applications  are  overly  restrictive  and  do  not  generalize  to  a 
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slightly  different  domain.  The  one  system/one  application  situation  requires  ex¬ 
perts  to  modify  the  system  significantly  in  order  to  accommodate  changes  in  task 
domains.  Some  people  pursue  the  middle  road  that  uses  some  psychological  and 
neurophysiological  evidence  to  guide  the  design  of  a  vision  system  [72].  The  hope 
is  to  build  a  system  that  has  some  generality  and  at  the  same  time  works  well  on  a 
class  of  real  problems. 

2.  The  Task  Domains: 

A  task  domain  imposes  strong  restriction  on  the  images  and  object  models,  and 
dictates  what  the  system  should  achieve.  The  task  also  determines  what  tradeoffs 
would  be  made,  such  as  efficiency  vs.  generality  and  efficiency  vs.  accuracy.  The 
use  of  search  cutoff  in  RAF  (i.e.,  thresholding)  and  precompiling  constraints  into  a 
table  in  Ikeuchi  et  al.  are  all  instances  of  these  tradeoffs. 

The  input  to  a  recognition  system  consists  of  one  or  more  images,  each  of  which  is 
either  a  2D  image,  a  2-^D  map,  or  some  other  preprocessed  image  containing  useful 
feature  descriptions.  The  output  is  what  can  be  used  to  perform  a  task,  for  example, 
picking  the  identified  object  from  a  bin.  Clearly  the  output  is  task  dependent.  The 
divergence  of  criteria  for  the  outputs  of  recognition  systems  is  part  of  the  reason 
that  comparing  different  approaches  is  very  difficult. 

3.  The  Answer: 

A  system  is  usually  designed  for  solving  a  special  class  of  problems.  Ideally  a 
good  system,  although  developed  for  specific  problems,  should  embody  principles 
sufficiently  independent  of  the  particular  applications  so  that  they  are  applicable 
to  other  domains  as  well,  without  too  severe  degradation  in  performance.  The 
system  can  be  part  of  a  more  general  system  that  handles  more  complex  problems. 
The  answer  to  this  dilemma  is  to  have  some  separation  between  domain-dependent 
modules  and  domain-independent  modules  of  the  system.  The  underlying  theory  of 
the  system  and  the  tradeoffs  in  its  working  implementation  need  to  be  separated  as 
well.  Some  tradeoffs  result  from  limitations  of  technology.  Others  are  from  the  lack 
of  either  the  understanding  of  or  the  availability  of  high  quality  low  level  modules. 
Ad  hoc  heuristics  are  introduced  in  order  to  make  the  theory  work.  However  systems 
bearing  good  separation  can  be  easily  improved  later. 


2.2.2  Representation 

Marr  [70]  characterized  the  criteria  and  design  of  representations  of  geometric  shapes.  He 
used  three  criteria  for  judging  the  usefulness  of  a  representation: 

1.  accessibility  —  whether  a  description  can  be  computed  easily; 

2.  scope  and  uniqueness  —  the  class  of  shapes  it  represents  and  the  uniqueness  of  the 
description  for  each  shape; 
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3.  stability  and  sensitivity  —  the  resolution  with  respect  to  small  variations  in  shape 
parameters. 

Marr  argued  that  a  good  representation  should  be  coordinate  system  free  (that  is,  use  an 
object-centered,  as  opposed  to  viewer-centered,  coordinate  system),  consist  of  primitives 
with  various  sizes,  and  admit  a  modular  organization.  He  also  discussed  the  use  of  indexing 
techniques  to  show  how  to  hierarchically  organize  the  object  parts  and  the  use  of  additional 
visual  cues  to  overcome  the  difficulty  of  recovering  the  shape  due  to  partial  occlusions. 

In  the  domain  of  recognition,  representations  are  necessary  for  images,  models,  and 
sensors1.  The  central  issue  is  how  to  choose  and  represent  features  and  constraints, 
since  the  choice  and  the  description  of  features  and  constraints  affect  the  performance 
(accuracy,  efficiency,  etc.)  of  recognition  algorithms.  Most  past  research  has  concentrated 
on  representing  models  and  images,  while  little  attention  has  been  paid  to  modeling 
sensors,  with  the  noble  exception  of  Dceuchi  et  al.  [62].  Without  an  analysis  of  sensors  it 
is  hard  to  discuss  the  features  quantitatively. 

To  summarize,  the  following  are  desirable  properties  of  a  representation: 

•  The  features  and  the  associated  constraints  capture  the  characteristics  of  object 
models,  images,  and  sensors.  Restrictions  and  assumptions  on  the  image,  model 
and  sensor  domains  are  explicitly  reflected.  There  exists  some  metric  of  feature 
quality  which  can  be  used  to  decide  which  features  to  choose  and  in  what  order 
they  are  used,  so  that  the  search  for  consistent  matches  is  most  efficiently  carried 
out.  The  representation  also  facilitates  the  acquisition  of  models. 

•  The  representations  of  images  and  models  admit  tractable  and  reliable  computa¬ 
tion  of  their  features.  The  sensor  representation  lends  itself  to  easy  analysis  of 
errors.  There  exist  robust  algorithms  for  extracting  features  and  precise  models  for 
describing  constraints. 

•  When  there  are  a  variety  of  features  and  constraints,  the  representation  has  a  mod¬ 
ular  organization  that  admits  some  form  of  indexing  into  the  model  library. 

•  There  exist  efficient  mechanisms  to  use  the  features  and  constraints  for  the  matching 
process  of  recognition  tasks.  We  will  discuss  this  in  details  in  the  following  section 
on  control. 


2.2.3  Control 

In  machine  recognition,  control  refers  to  the  use  of  features  and  constraints  of  the  represen¬ 
tation  in  the  recognition  process.  A  brute-force  search  for  matches  between  the  observed 
object  and  a  model  will  develop  a  huge  search  tree,  since  there  is  an  enormous  semantic 

lFor  example,  camera  models  and  light  models. 


7 


distance  between  the  input  and  the  output  of  a  recognition  system.  The  resulted  matches 
are  also  sensitive  to  small  variations  in  object  and  image  parameters. 

One  important  control  paradigm  is  consistent  labeling2,  which  has  its  roots  in  the 
work  of  interpreting  picture  line  drawings  [41,  49,  20,  85].  It  originated  from  Huffman 
and  Clowes’s  work  on  line  drawing  labeling  [49,  20].  This  paradigm  works  as  follows: 
given  a  catalogue  of  image  labels  and  consistency  rules,  the  algorithm  develops  a  tree 
by  assigning  each  scene  feature  with  a  label  and  backtracking  when  the  assignment  i3 
inconsistent  with  what  is  already  known.  Waltz  [85]  uses  a  filtering  process  to  eliminate 
locally  inconsistent  label  assignments  by  the  consistency  rules  and  then  conducts  a  global 
tree  search.  As  the  result  of  the  filtering,  the  size  of  the  tree  search  is  reduced. 

Successful  recognition  can  be  viewed  as  a  consistent  labeling  of  the  image  objec  ub- 
ject  to  some  (often  geometric)  constraints.  One  common  control  strategy  is  hypothesize- 
and- verify,  that  is,  hypothesize  a  match  of  a  model  to  the  observed  object  and  verify  if  the 
match  gives  a  legal  interpretation  to  the  image.  Various  techniques  are  used  to  further 
reduce  the  need  for  search.  Attention-focus  [7]  and  preliminary  grouping  [63,  64]  are  two 
such  strategies.  How  global  knowledge  of  models  is  used  also  affects  the  performance. 

Approaches  based  on  consistent  labeling  have  several  problems  that  deserve  special 
attention.  A  complete  catalogue  of  legal  labels  might  contain  too  many  entries  (e.g. 
several  million)  that  are  too  expensive  to  search  through.  The  problem  demands  the  use  of 
some  form  of  indexing  into  the  catalogue  in  order  to  achieve  any  reasonable  performance. 
Hierarchical  grouping  of  entities  can  reduce  the  search  time.  The  question  is  how  to 
organize  the  catalogue  that  reflects  the  semantic  structure  of  the  world.  Worse,  it  is 
possible  that  a  complete  catalogue  of  legal  labels  is  either  technically  difficult  to  obtain  or 
physically  non-existent.  For  picture  line  drawings  where  a  junction  is  formed  by  arbitrary 
number  of  edges,  Huffman  [51]  showed  that  a  decision  procedure  can  instead  be  used  to 
eliminate  impossible  labelings.  However,  in  general  whether  we  can  find  some  test  for 
arbitrary  objects  is  still  a  question. 


3  Three  Model-based  Recognition  Systems 

This  section  reviews  the  three  approaches  of  ACRONYM,  RAF,  and  Ikeuchi  et  al.  to 
machine  recognition.  We  use  the  requirements  for  representation  and  control  and  the 
principles  of  generality  to  evaluate  each  approach.  The  task  domain  and  the  intended 
level  of  performance  of  each  of  three  systems  is  different  from  each  other,  ranging  from 
general  to  specific.  The  implementation  of  each  system  is  greatly  influenced  by  its  task 
to  accomplish.  Many  task-domain  specific  tradeoffs  are  made  along  the  way  from  the 
general  theory  of  each  system  to  its  working  implementation. 

2An  alternative  to  consistent  labeling  is  the  relaxation  method.  Simply  stated,  the  relaxation  method 
is  a  spring-loaded  template  matching.  Interested  readers  should  refer  to  [21]  for  more  details  on  the 
relaxation  method. 
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There  are  commonalities  among  them.  Although  each  system  chooses  somewhat  dif¬ 
ferent  representation  and  description  of  features  and  a  different  set  of  constraints,  and 
uses  somewhat  different  techniques  to  reduce  the  size  of  search  space,  all  three  systems  use 
more  or  less  the  same  control  strategy,  i.e.,  consistent  labeling.  These  approaches  search 
for  possible  interpretations  subject  to  their  own  set  of  constraints.  The  two  principal 
components  of  the  three  systems  are  the  representation  of  features  and  constraints  and 
the  search  under  constraints. 


3.1  ACRONYM 

3.1.1  Overview 

ACRONYM  was  developed  by  Brooks  et  al.  [16]  at  Stanford  in  the  late  70’s  and  early  80’s. 
It  was  intended  to  be  a  general,  domain-independent,  model-based  recognition  system. 
More  precisely,  it  is  a  working  implementation  of  a  general  recognition  theory  of  prediction 
from  models  and  interpretation  of  images.  It  uses  a  volumetric  representation  of  objects 
and  a  domain-independent  geometric  and  algebraic  constraint  manipulation  system.  The 
system  has  been  tested  on  several  aerial  images  of  airfields  and  has  successfully  recognized 
certain  wide-bodied  jets. 

ACRONYM  is  equipped  with  a  library  of  object  models.  The  geometric  object  models 
are  specified  in  terms  of  a  few  simple  primitives  and  their  relations.  Given  a  2D  image, 
ACRONYM  identifies  instances  of  object  model  classes,  determines  their  location  and 
orientation  in  world  coordinates,  makes  subclass  identification  if  possible,  and  determines 
location  and  orientation  of  camera  if  necessary. 

Brooks  [15]  views  recognition  as  a  task  of  finding  consistent  matches  between  elements 
of  an  image  and  those  of  a  model.  The  question  is  at  what  level  the  matches  take  place.  He 
argues  that,  in  contrast  to  the  traditional  image-driven  view  of  vision  as  inverse  optics  [72], 
recognition  can  proceed  from  the  model  in  coarse  to  fine  levels.  Figure  2  shows  the  levels 
of  representations  of  the  model  and  the  image  and  the  correspondence  between  them  at 
each  level.  The  appropriate  descriptions  of  the  model  are  predicted  from  the  model  and 
are  matched  against  the  descriptions  of  the  image  at  the  same  level. 

The  approach  of  ACRONYM  to  image  understanding  relies  on  four  components:  ob¬ 
ject  models,  prediction  from  models,  interpretation  of  image  descriptions  in  terms  of  model 
prediction,  and  descriptions  of  images.  These  four  modules  are  associated  with  four  data 
structures:  object  graph  and  restriction  graph,  prediction  graph,  interpretation  graph, 
and  picture  graph.  Figure  3  shows  the  four  modules,  data  structures,  and  data  flow  paths 
of  ACRONYM.  ACRONYM  deals  with  only  the  first  three  components.  It  does  not  use 
the  rich  image  descriptions  available.  Instead  it  uses  a  simple  and  primitive  system  to 
produce  descriptions  of  images.  The  three  components  are  tied  together  in  ACRONYM 
by  two  threads:  geometric  models  and  relationships,  and  algebraic  constraints. 
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Figure  2:  Levels  of  representation  (from  Brooks) 


Figure  3:  The  ACRONYM  System  (from  Brooks) 
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3.1.2  Evaluation 


In  the  following  discussions,  the  ACRONYM  system  is  evaluated  in  terms  of  its  four 
modules  and  the  interactions  among  them. 

1.  Object  models 

ACRONYM  uses  view-independent  volumetric  representation  for  the  objects.  The 
volume  primitives  are  generalized  cones  [5].  The  geometric  representation  models 
the  objects  and  their  spatial  relations.  Generic  object  classes  and  specific  objects  are 
modeled  by  class  and  subclass  relations  through  algebraic  specifications.  Variations 
of  size,  structure  and  spatial  relations  within  object  classes  are  modeled.  Volumetric 
models  and  spatial  relations  are  represented  by  the  Object  Graph  (Figure  3).  The 
nodes  are  volume  elements  and  arcs  are  spatial  relations  and  subpart  relations.  A 
user  specifies  models  through  a  text-based  description  language.  A  graphics  module 
provides  the  user  with  feedback  during  the  process  of  modeling. 

A  generalized  cone  is  produced  by  sweeping  a  cross  section  along  a  spine  through 
space.  The  cross  section  is  kept  at  a  constant  angle  while  sweeping  along  the  spine 
and  is  deformed  as  specified  by  a  sweeping  rule.  For  example,  an  airplane  is  modeled 
by  a  fuselage  with  two  wings  attached  to  it,  each  of  which  is  a  generalized  cone. 
Many  man-made  objects  can  be  decomposed  into  volume  primitives  in  a  few  ways. 

However,  generalized  cones  with  arbitrary  deformation  functions  and  general  shapes 
of  cross  section  will  be  computationally  intractable  to  deal  with.  Certain  tradeoffs 
have  to  be  made  for  the  purpose  of  realistic  implementation.  In  ACRONYM,  the 
generalized  cones  are  therefore  further  restricted.  The  cross  section  is  bounded 
by  straight  lines  and  circular  arcs,  spines  are  piece-wise  linear  or  circular,  and  the 
sweeping  rule  is  linear  and  continuous.  As  a  result,  the  relatively  simple  geometric 
representation  facilitates  the  prediction  of  features. 

One  of  the  requirements  for  representation  is  that  the  representation  is  unique,  that 
is,  it  admits  a  unique  decomposition  of  an  object  into  parts.  If  there  is  more  than 
one  decomposition  for  an  object,  then  the  image  and  the  model  may  be  decom¬ 
posed  incompatibly  and  fail  to  match  correctly.  Although  many  man-made  objects 
admit  decomposition  into  subparts,  in  general  it  is  hard  to  characterize  real  world 
objects  with  a  few  such  generic  parts.  Worse,  many  objects  do  not  have  a  natural 
decomposition.  For  example,  it  is  awkward  to  think  about  a  bushy  object  in  terms 
of  generic  subparts.  Bushy  objects  are  more  effectively  dealt  with  using  other  types 
of  descriptions  such  as  edges. 

ACRONYM  models  a  complex  object  by  a  set  of  generalized  cones.  The  spatial 
relations  of  the  hierarchy  of  object  parts  are  represented  by  subpart  hierarchy  and 
an  affixment  tree.  The  class/subclass  relations  capture  the  commonality  of  class 
members  and  variations  across  class  members.  This  is  advantageous  for  cases  where 
only  partial  recognition  can  be  obtained.  For  example,  an  unknown  car  can  be 
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recognized  as  a  member  of  a  compact  car  and  leaves  the  specific  model  type  unde¬ 
termined.  Additional  information,  such  as  the  local  shape  of  the  trunk,  the  color, 
etc.  will  help  to  disambiguate  among  different  models. 

2.  Prediction 

In  ACRONYM  features  are  predicted  by  geometric  reasoning  techniques.  The  ge¬ 
ometric  reasoning  mechanism  combined  with  the  geometric  and  algebraic  represen¬ 
tations  makes  ACRONYM  powerful  and  flexible  in  predicting  features. 

Object  features  in  ACRONYM  are  invariant  with  respect  to  both  object  class  and 
viewpoint.  They  include  shapes  (ribbons  and  ellipses)  and  2D  spatial  relations  of 
shapes  in  the  image,  and  are  specified  by  various  constraints.  Constraints  are  man¬ 
aged  by  a  constraint  manipulation  system  (CMS).  The  object  models  are  analyzed 
for  variations,  structure,  and  spatial  relations  in  the  object  model  classes  and  its 
expected  features  are  predicted.  The  result  is  the  Prediction  Graph  (Figure  3) 
whose  nodes  specify  predictions  of  image  features,  and  whose  arcs  specify  relations 
of  image  features.  The  prediction  graph  tells  the  matcher  how  to  find  instances.  It 
provides  a  coarse  filter  for  hypothesizing  matches  of  object  and  image  features.  The 
prediction  graph  also  contains  instructions  on  how  to  use  the  measurement  of  image 
features  to  deduce  3D  information  about  original  models.  Note  that  the  prediction 
graph  is  a  symbolic  summary  of  expected  image  appearances  of  models.  It  does  not 
produce  image  appearances  of  instances  of  models. 

The  flow  of  control  in  prediction  is  backward-chained.  A  breadth  walk  down  the 
subpart  hierarchy  of  object  models  predicts  shapes  at  each  level  and  partially  inter¬ 
prets  the  image.  Refined  prediction  is  made  at  the  next  level.  The  domain  indepen¬ 
dent  constraint  manipulation  system  (CMS)  deduces  and  propagates  constraints. 
As  Brooks  noted  much  of  ACRONYM’S  expertise  comes  from  the  generality  of  the 
CMS  to  handle  a  variety  of  constraints  [16],  however  at  the  expense  of  being  slow. 
Although  it  has  successfully  interpreted  some  aerial  images,  applying  ACRONYM 
to  perform  some  realistic  applications  would  require  significant  improvement  in  its 
speed.  One  way  to  achieve  this  is  to  prespecify  the  constraints,  such  as  the  angle 
between  wings  and  fuselage  of  a  certain  type  of  jets,  then  compile  the  constraints 
into  tables.  Runtime  constraints  manipulation  would  merely  be  a  table  lookup. 
Grimson  and  Lozano- Perez  [33]  explored  this  idea  in  their  recognition  system  RAF. 

3.  Interpretation 

The  interpretation  proceeds  by  merging  local  hypothesized  matches  subject  to  con¬ 
sistently  derived  constraints  about  variations  of  size,  structure,  and  spatial  rela¬ 
tions.  The  candidate  image  features  are  provided  by  image  descriptive  process  and 
are  matched  to  predicted  object  features.  If  the  match  is  consistent  with  what  is 
already  known  about  the  model,  it  puts  additional  constraints  on  the  parameters  of 
the  3D  model. 
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The  descriptive  process  is  invoked  repeatedly.  At  first  the  multiple  invocations 
search  for  different  image  features  to  obtain  a  coarse  image  interpretation.  Later 
invocations  search  for  small  areas  of  image  for  particular  features  in  order  to  get 
detailed  object  class  identification  and  three  dimensional  quantification  of  objects. 
The  prediction,  description,  and  interpretation  proceed  concurrently  from  coarse 
object  subpart  and  class  interpretations  of  images  to  fine  distinctions  among  object 
subclasses  and  more  precise  3D  information  about  objects,  as  finer  and  finer  details 
of  objects  are  identified. 

The  mechanism  here  is  essentially  consistent  labeling.  It  hypothesizes  a  global 
match,  predicts  additional  evidence  from  coarse  to  fine,  matches  against  that  of 
images,  and  discards  inconsistent  matches.  The  constraints  on  the  parameters  of 
the  hypothesized  global  match  are  successively  tightened  along  the  way  by  additional 
supporting  evidence.  Although  at  the  high  level  it  has  some  flavor  of  the  parameter 
relaxation  method  [21],  it  is  closer  to  consistent  labeling.  With  the  hypothesized 
match,  the  search  is  constrained  by  the  knowledge  about  the  model.  Again,  this  is 
made  possible  by  the  class/subclass  hierarchy.  In  some  sense  ACRONYM  is  close  to 
the  alignment  method  [83]  which  hypothesizes  a  match  by  transforming  the  model 
to  the  observed  object  and  uses  the  model  to  constrain  the  search,  although  the 
constraints  of  ACRONYM  are  much  looser. 

4.  Image  description 

ACRONYM  uses  a  low  level  feature  description  module  to  link  together  primitive 
image  edges.  The  local  image  feature  descriptions  and  matches  with  the  models 
are  guided  by  the  predictions  which  provides  goal  direction  to  an  edge-linking  algo¬ 
rithm  [11].  The  edge  linker  is  directed  to  preliminarily  group  together  those  edge 
segments  that  are  likely  to  come  from  the  same  object  in  the  image. 

This  initial  grouping  reduces  the  search  in  the  interpretation  process.  Jacobs  [64] 
has  shown  that  similar  techniques  of  preprocessing  image  features  have  the  effect  of 
reducing  search  time  and  increasing  accuracy  of  matches.  The  use  of  edging  linking 
process  explains  partially  why  ACRONYM  does  so  well  even  though  the  initial 
image  descriptions  are  very  poor.  The  drawback  of  ACRONYM’S  edge  linking 
preprocessing  is  that  the  edge  linker  uses  some  domain  specific  heuristics  of  the 
aerial  images  of  airfields,  which  are  not  generalizable  to  other  applications,  to  link 
the  edge  descriptions  of  poor  quality. 


In  summary,  ACRONYM  was  designed  to  investigate  the  use  of  models  that  are  in¬ 
dependent  of  particular  descriptive  processes  and  to  develop  general  mechanisms  for  ge¬ 
ometric  reasoning  and  symbolic  constraint  propagation.  Its  strength  comes  mainly  from 
(1)  the  simple  geometric  models  and  relations  and  the  geometric  reasoning  system  used  to 
predict  and  guide  interpretation;  (2)  the  algebraic  constraints  and  the  powerful  constraint 
manipulation  system  that  model  class  relations,  give  quantitative  aspect  to  predictions, 
prune  incorrect  interpretations,  and  provide  3D  information  about  hypothesized  objects; 
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and  (3)  the  initial  edge  linking  process  that  reduce  the  search  and  increase  the  accuracy 
of  the  interpretation. 

The  philosophy  of  ACRONYM  capitalizes  only  on  knowledge  of  geometric  properties 
of  objects  and  general  mechanism  of  constraint  manipulation.  It  assumes  very  little 
dependence  on  image  descriptions.  We  have  seen  in  the  above  analysis  that  in  order  for 
ACRONYM  to  achieve  realistic  performance  on  real  images  some  tradeoffs  are  made  in  the 
implementation.  This  is  somewhat  a  departure  from  the  general  philosophy  its  designers 
had  in  mind  at  the  beginning.  The  approach  of  Grimson  and  Lozano-Perez  discussed  in 
the  next  section  addresses  issues  of  obtaining  high  performance  on  moderately  complicated 
images,  by  trading  off  some  generality  of  the  system. 

ACRONYM  does  not  pay  special  attention  to  occlusions  and  shadows.  To  extend  the 
system  to  handle  these  situations  it  should  explore  the  rich  knowledge  of  image  descrip¬ 
tions.  Other  type  of  visual  cues  such  as  that  of  image  formation  may  also  be  used,  in 
addition  to  the  geometric  knowledge  of  objects,  to  enhance  performance  of  the  system. 
However,  in  general  we  still  don’t  know  how  to  integrate  different  visual  cues  into  a  single 
frame  that  uses  them  discretionarily  and  invokes  them  at  appropriate  times. 


3.2  RAF 

3.2.1  Overview 

Grimson  and  Lozano-Perez  [33]  of  MIT  reported  an  approach  to  model-based  recogni¬ 
tion  based  on  searching  in  the  form  of  an  interpretation  tree  subject  to  local  geometric 
constraints.  Their  implementation  of  the  method  is  the  RAF3  system. 

The  approach  of  Grimson  and  Lozano-Perez  is  clearly  an  instance  of  consistent  label¬ 
ing.  The  task  is  to  identify  and  locate  objects  in  cluttered  environments,  that  is,  to  find 
what  objects  are  in  the  scene  and  where  they  are,  given  a  library  of  known  object  models. 
Recognition  consists  of  two  steps:  generation  of  feasible  interpretations  (hypothesizing 
step)  and  model  test  (verification  step).  The  first  step  is  to  develop  an  interpretation  tree 
and  search  for  correct  matches  in  the  interpretation  tree.  It  identifies  what  objects  are  in 
the  scene  by  relating  the  observed  objects  with  the  known  models.  Local  geometric  con¬ 
straints  are  used  to  prune  the  tree  search.  The  second  step  checks  for  global  consistency. 
It  also  finds  the  positions  and  orientations  of  the  identified  objects  in  the  image. 

The  method  assumes  that  objects  are  approximated  by  a  collection  of  polyhedral  mod¬ 
els.  The  adequacy  of  the  model  is  discussed  in  a  later  paper  [34].  Curved  objects  are  hard 
to  model  in  this  way.  The  polyhedral  model  is  also  not  stable.  Small  variations  in  parame¬ 
ters  of  a  curved  object  can  result  in  a  significantly  different  polyhedral  approximation  and 
lead  to  erroneous  matches.  RAF  works  in  the  environment  of  robot  sensing.  The  input 


3 RAF  stands  for  Recognition  and  Attitude  Finder. 
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Figure  4:  Interpretation  Tree  (from  Grimson  et  al.) 


sensed  data  to  RAF  consist  of  positions  and  the  associated  normals  of  object  surface.  The 
data  can  be  sparse,  which  is  typical  of  tactile  sensing. 

In  an  earlier  version  of  RAF  [30],  the  sensed  data  are  assumed  to  come  from  a  single 
object,  either  2D  or  3D,  and  no  occlusion  is  allowed.  The  recognition  is  formulated  as  a 
search  of  an  interpretation  tree.  The  interpretation  tree  is  constructed  in  a  depth  first 
fashion  by  assigning  each  sensed  data  point  onto  every  face  of  a  model.  A  tree  branch 
is  developed  by  pairing  a  data  point  with  a  face.  Every  path  to  leaves  of  the  tree  is  a 
possible  interpretation  of  the  sensed  data  with  respect  to  the  model.  Then  each  possible 
interpretation  is  tested  against  the  model  to  see  all  the  points  are  on  and  within  the  faces 
of  the  model.  Let  a  be  the  number  of  data  points  and  n  be  the  number  of  faces  of  the 
object:  the  interpretation  tree  is  a  levels  deep  and  each  node  has  branching  degree  of  n. 
Figure  4  shows  an  interpretation  tree  for  a  2D  polygonal  object  model.  RAF  generates 
an  interpretation  tree  for  each  model. 

A  brute-force  search  in  the  interpretation  stage  is  too  expensive  to  be  feasible,  since 
the  number  of  possible  interpretations  that  the  model  is  tested  against  is  the  number  of 
tree  leaves,  or  n*.  Three  types  of  geometric  constraints  [30]  are  considered  to  reduce  the 
size  of  search  space:  (I)  the  distance  constraint  for  distances  between  faces;  (2)  the  angle 
constraint  for  angles  between  face  normals;  and  (3)  the  direction  constraint  for  angles 
of  vectors  between  sensed  points.  For  example,  the  possible  distance  between  a  pair  of 
points  on  object  faces  has  to  be  within  the  legal  range  specified  by  the  geometric  model 
of  the  object  and  the  error  model  of  measurements. 

RAF  is  later  generalized  to  handle  occluded  objects  [33].  To  deal  with  the  situation 
where  input  data  may  come  from  multiple  objects,  a  null  branch  is  created  below  each 
interpretation  tree  node.  Assigning  a  data  point  to  this  node  is  equivalent  to  discarding 
the  point  as  inconsistent  with  the  model  (note  that  one  interpretation  tree  is  constructed 
per  model).  In  addition  to  the  three  powerful  geometric  constraints,  further  mechanisms 
are  used  to  limit  the  search: 

•  preprocessing  to  find  extended  features  such  as  edges; 
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•  heuristically  guided  search  and  premature  search  termination; 

•  Hough  clustering4  to  preselect  likely  subspaces. 


i 


3.2.2  Evaluation 

1.  Mechanisms  to  improve  the  performance: 

Simulations  of  RAF  on  several  2D  and  3D  examples  have  shown  the  power  of  local 
geometric  constraints  in  reducing  the  effort  of  the  search.  RAF  has  since  been 
extended  to  handle  scaled  objects  [35],  curved  objects  [34],  as  well  as  objects  that 
contain  movable  subparts  [35].  Preliminary  results  show  that  the  search  increases 
significantly  with  the  complexity  of  the  scene.  It  should  be  noted  that  each  extension 
is  developed  under  the  assumption  that  only  one  variation,  for  example  scaled, 
curved,  or  movable  object,  occurs  at  a  time.  Future  extension  will  be  to  the  cases 
where  more  than  one  variation  in  object  is  allowed.  It  will  be  interesting  to  see  how 
the  techniques  developed  for  individual  extensions  can  be  combined  and  how  the 
performance  degrades.  Another  direction  is  to  explore  additional  constraints  that 
capture  characteristics  of  objects  in  these  situations. 

The  performance  of  RAF  is  greatly  improved  by  the  use  of  extended  features  such 
as  edges  for  3D  samples.  Point-like  data  are  shown  to  be  inefficient  when  objects 
are  occluded.  The  initial  segmentation  and  grouping  to  obtain  extended  features 
drastically  reduce  the  size  of  search  space.  Another  technique  is  the  use  of  Hough 
clustering  to  prefilter  candidates  for  hypotheses  so  that  the  attention  of  the  system 
can  focus  on  “good"  ones.  This  method  helps  reduce  the  search  time,  however  at 
the  expense  of  introducing  additional  matching  errors. 

The  introduction  of  “null  branches"  increases  exponentially  the  number  of  feasi¬ 
ble  interpretations  generated  and  therefore  increases  greatly  the  complexity  of  the 
model  test.  Heuristic  search  ordering  is  used  to  guided  the  search  towards  “good" 
interpretations  by  a  measure  of  quality.  The  search  is  cut  off  as  soon  as  the  measure 
reaches  an  acceptable  level.  The  heuristic  technique  —  the  premature  termination 
of  the  search  —  is  essential  in  reducing  the  search  time. 

In  the  experiments,  a  simple  heuristic  of  ordering  picks  more  distant  points  first 
and  puts  them  at  the  early  stage  of  the  paring  process.  This  puts  the  most  effective 
constraints  at  the  beginning  and  results  in  pruning  out  entire  subtrees  at  as  early  a 
stage  in  the  tree  generation  as  possible.  The  interesting  question  is,  when  different 
type  of  constraints  become  available,  in  what  order  the  constraints  should  be  applied 
to  most  effectively  prune  the  tree  search.  The  key  is  to  have  a  metric  of  weighing 
constraints  according  to  their  power  in  constraining  the  search. 

4The  Hough  clustering  method  used  here  is  a  variation  of  classical  ones  for  finding  an  object’s  pose.  A 
Hough  transform  accumulates  evidence  for  coordinate  transformations  in  a  parameter  space  whose 
are  the  quantised  transformation  parameters.  Large  clusters  of  similar  transformations  in  that  space  are 
taken  as  evidence  of  a  correct  match.  For  details  please  refer  to  [37]. 
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The  results  show  th&t  the  execution  time  grows  rapidly  with  the  complexity  of  each 
model,  as  well  as  the  number  of  total  models.  The  performance  of  the  algorithm 
depends  on  the  error  in  the  measurement  of  sensed  data  points  (such  as  positional 
error  of  points  and  angular  error  of  surface  normals).  Partial  symmetries  in  the 
objects  often  lead  to  multiple  interpretations  and  account  for  increased  matching 
time  and  most  of  the  matching  errors.  Additional  constraints  from  a  careful  symme¬ 
try  analysis  may  resolve  the  ambiguities.  The  interpretation  is  also  sensitive  to  the 
sensing  strategy  such  as  the  size  of  spacing  between  grid  points.  Exploring  sensing 
strategy  to  disambiguate  multiple  interpretations  is  considered  in  [32]. 

2.  Local  constraints: 

RAF  uses  only  local  geometric  constraints.  This  is  in  line  with  the  assumption 
that  only  sparse  data  are  available  through  sensing  like  tactile  sensors.  Simple  con¬ 
straints,  such  as  distance,  angle,  and  direction  constraints,  are  proven  to  be  effective 
for  isolated  objects.  However,  overlapping  objects  increase  the  complexity  dramat¬ 
ically.  Other  type  of  constraints  can  potentially  be  used  if  they  can  be  obtained 
or  derived  readily  from  measurements.  Grimson  [36,  39]  gives  a  formal  combina¬ 
torial  analysis  of  the  complexity  of  recognition  algorithm  based  on  searching  the 
interpretation  tree,  under  some  probabilistic  assumptions  on  the  data  distribution. 
He  shows  that  the  expected  complexity  is  quadratic  in  the  number  of  features  for 
isolated  objects,  and  can  be  reduced  to  polynomial  time  in  cluttered  environments 
using  Hough  transforms  and  premature  termination. 

It  is  interesting  to  notice  that  more  complicated  constraints,  like  the  coupled  con¬ 
straints  discussed  in  [33],  do  not  significantly  reduce  the  size  of  search,  and  instead 
increase  the  search  time.  Two  reasons  account  for  the  result:  the  coupled  constraints 
do  not  offer  more  information  than  simple  constraints  and  are  more  expensive  to 
compute.  Introduction  of  new  features  and  constraints  should  take  into  considera¬ 
tions  the  computational  aspects  of  features  such  as  how  easily  and  reliably  they  can 
be  computed. 

3.  Heuristics: 

The  use  of  two  heuristic  techniques,  the  Hough  transforms  and  premature  termi¬ 
nation,  reduces  the  size  of  search.  Since  the  Hough  transforms  focus  matches  on 
“ good ”  candidates,  the  branching  factor  of  the  interpretation  tree  is  cut  down.  The 
sensitivity  of  Hough  transforms  is  discussed  in  a  later  paper  by  Grimson  [37].  The 
premature  termination  guides  the  search  to  maximize  some  measure,  in  this  case 
the  sum  of  areas  of  faces  explored  in  an  interpretation,  and  prematurely  terminates 
the  search  at  some  threshold.  The  use  of  the  heuristic  of  maximizing  area  is  a  patch 
for  the  complexity  explosion  due  to  the  introduction  of  null  nodes.  The  premature 
termination  imposes  a  limit  on  the  height  of  the  interpretation  tree  by  threshold¬ 
ing  and  therefore  reduces  the  length  of  the  search  paths.  However,  a  search  with 
bad  thresholding  can  miss  the  best  interpretations  or  accept  a  “good”  interpreta¬ 
tion  when  a  better  one  exists.  The  tradeoff  here  is  the  efficiency  versus  quality  of 
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interpretation.  Thus  the  choice  of  appropriate  measure  for  matching  quality  and 
appropriate  threshold  to  terminate  the  search  is  the  key.  Grimson  [38]  shows  that 
the  optimal  threshold  can  be  found,  again  under  some  assumptions  on  the  data 
distribution. 

4.  Further  improvement: 

In  addition  to  what  we  have  discussed  above,  the  performance  of  RAF  can  be  further 
enhanced  by 

•  Grouping:  Initial  grouping  of  features,  such  as  edges,  that  likely  come  from 
same  object  can  further  reduce  the  size  of  the  search  and  improves  the  accuracy 
and  robustness  of  the  interpretation.  Jacobs  [64]  has  shown  that  speedup  of 
several  order  of  magnitude  can  be  expected,  by  using  some  simple  grouping 
techniques. 

•  Search  attention  focus:  Techniques  similar  to  the  Hough  transforms  can  be 
used  to  filter  candidates  for  matches  so  that  the  “seeds”  left  can  direct  and 
constrain  the  search.  More  distinctive  features  can  also  be  used,  if  we  can  find 
them,  to  focus  matches  on  a  smaller  set  of  features. 

•  Use  of  more  model  knowledge  to  constrain  the  search:  Additional  knowledge 
of  the  models,  for  example  global  features,  can  be  used  to  further  constrain 
the  search.  At  the  hypotheses  generation  stage  more  model  knowledge  can  be 
assumed.  Incorrect  hypotheses  can  be  rejected  at  the  verification  stage. 

•  Hierarchical  representation  with  indexing: 

For  objects  that  are  curved,  the  number  of  edge  segments  used  to  approximate 
the  objects  is  large.  Hierarchical  representation  such  as  a  “strip  tree”  can  be 
used  [28]. 

•  Iterative  transformation  computation: 

RAF  uses  a  few  data  points  to  estimate  the  model  to  scene  transformation. 
An  average  over  many  individually  computed  transformations  is  used  to  reduce 
the  effect  of  errors.  A  better  method  would  iteratively  improve  the  estimate  of 
the  transformation  when  more  data  are  available.  The  initial  estimate  of  the 
transformation  can  even  be  used  to  prune  the  tree  search  at  the  hypothesizing 
stage.  Therefore  the  hypothesizing  stage  is  interleaved  with  the  verification 
stage.  At  each  iteration  features  are  predicted,  subject  to  the  local  constraints 
such  as  those  of  RAF.  A  partial  verification  checks  the  prediction  using  the 
estimate  of  the  transformation.  The  result  of  the  partial  verification  is  then 
used  to  refine  the  estimate  of  the  transformation.  However,  presently  rotations 
in  RAF  are  represented  in  terms  of  angle/axis  and  orthonormal  matrices  that 
do  not  lead  to  a  simple  iterative  refinement.  Recomputing  the  transformation 
at  each  iteration  would  be  too  expensive.  The  quaternion  representation  for 
rotations  of  Faugeras  [25]  is  an  elegant  alternative  that  leads  to  an  iterative 
least  square  method. 


3.3 


Ikeuchi  et  al. 


3.3.1  Overview 

The  approach  of  Ikeuchi  et  ai.  at  CMU  [60]  deals  with  automatic  generation  of  object 
recognition  programs.  It  is  one  of  a  few  model-based  recognition  systems  that  model 
sensors,  in  addition  to  geometries  of  objects.  Ikeuchi  et  al.  discussed  issues  and  techniques 
for  automatic  generation  of  recognition  programs  by  compilation,  based  on  a  method  in 
an  earlier  paper  [59].  To  evaluate  the  approach  of  Ikeuchi  et  al.,  we  ask  the  following 
questions:  (1)  what  is  its  intended  domain?  (2)  how  well  does  it  work?  (3)  how  can  the 
method  be  generalized  to  handle  more  complex  situations?  More  specifically,  we  discuss 
how  their  method  models  objects  and  sensors  and  how  noise  is  treated. 

The  system  of  Ikeuchi  et  al.  works  in  bin-picking  task  domains,  using  3D  information 
including  depth  maps.  It  assumes  that  only  instances  of  a  single  object  model,  possible 
jumbled,  are  present  in  the  image.  It  does  not  handle  occlusions,  since  it  assumes  that  it 
can  always  pick  the  top  object  first.  The  task  of  the  system  is  to  automatically  generate 
recognition  programs  that  determine  the  precise  position  and  orientation  of  an  object 
at  the  compile  time.  If  general  model-based  recognition  consists  of  hypothesizing  and 
verification  (model  test),  then  this  system  only  handles  the  model  test  part  that  produces 
the  transformation.  On  the  other  hand,  Ikeuchi  et  al.  present  a  complete  methodology  of 
programming  for  vision  recognition. 

Historically  edge-based  approaches  have  attacked  bin-picking  problems  by  focusing  on 
brightness  changes  and  working  on  the  resulting  binary  images.  These  algorithms  work 
fine  for  2D  objects  on  a  flat  table  with  a  well  defined  background.  However  recognition  of 
more  complicated  scenes  containing  3D  objects  requires  more  knowledge  of  the  image  and 
the  objects.  Ikeuchi  et  al.  treat  the  task  as  being  extraction  of  useful  features  and  control 
of  using  these  features  to  recognize  objects.  They  present  a  method  for  automatically 
compiling  object  and  sensor  models  into  recognition  programs,  based  on  a  careful  analysis 
of  the  object  models,  the  sensor  models,  and  the  characteristics  of  their  interplay  in  the 
scene. 

The  system  of  Ikeuchi  et  al.  consists  of  the  following  key  steps: 

1.  object  modeling  —  geometric  and  photometric  properties; 

2.  sensor  modeling  —  sensor  characteristics  and  variations  of  feature  values  in  terms 
of  detectability  and  reliability; 

3.  prediction  of  appearances  —  aspects; 

4.  strategy  generation  —  interpretation  tree; 

5.  program  generation  —  object-oriented  programming. 
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The  object  model  is  assumed  to  be  polyhedral.  First  all  the  possible  appearances  of 
the  object  are  enumerated.  Ikeuchi  et  al.  use  aspects  and  features  associated  with  each 
aspect  as  intermediate  representations.  An  aspect  is  defined  as  a  topologically  equivalent 
class  of  appearances  of  an  object.  Shape  changes  between  aspects,  the  aspect  changes,  are 
nonlinear.  Shape  changes  within  an  aspect  are  called  linear  changes.  All  the  appearances 
of  an  object  model  are  classified  into  different  aspects.  The  expected  values  of  features  for 
each  aspect  are  predicted  find  used  to  discriminate  different  aspects.  The  linear  change  is 
then  determined  by  a  transformation  from  the  aspect  of  the  object  to  the  appearance  of 
the  object. 

An  interpretation  tree  classifies  an  appearance  of  an  object  into  one  of  its  aspects  that 
correspond  to  the  leaves  of  the  tree  and  determines  the  linear  change  within  the  aspect. 
Figure  5  shows  an  interpretation  tree  generated  for  a  simple  industrial  part.  The  root  of 
the  tree  represents  all  the  aspects  of  the  object  while  the  leaves  are  five  distinct  aspects 
of  the  object,  SI,  S2,  S3,  S4,  and  S5.  Features  such  as  inertia  and  local  geometries  are 
used  to  discriminate  aspects  at  each  node.  Within  each  aspect  class,  i.e.,  within  each 
leaf  node,  linear  changes  are  then  determined  using  the  EGI5  and  other  features  in  the 
intermediate  coordinate  systems  —  the  face  and  edge  feature  coordinate  systems.  Thus 
the  interpretation  tree  represents  a  recognition  strategy.  Applying  an  interpretation  tree 
to  an  appearance  of  the  object  in  the  scene  classifies  the  appearance  into  one  of  the  aspects 
and  then  determines  its  precise  position  and  orientation  within  the  aspect. 

The  aspects  and  their  predicted  features  are  view-dependent.  The  runtime  efficient  is 
improved  by  enumerating  all  the  possible  variations  of  features  due  to  variations  in  viewing 
directions,  for  example  aspects,  and  predetermining  the  search  path  in  the  interpretation 
tree.  Sensors  and  objects  are  explicitly  modeled.  Features  are  quantitatively  modeled 
in  terms  of  detectability  and  reliability8,  as  opposed  to  other  recognition  systems,  for 
example  RAF,  that  only  use  “hard  numbers” —  step  intervals — to  model  uncertainties  of 
features.  The  feature  values  for  the  scene  object  are  obtained  from  three  maps,  the  needle 
map  (for  surface  normals),  the  edge  map,  and  the  depth  map.  Dual  photometric  stereo 
can  be  used  to  produce  the  maps. 

The  interpretation  tree  is  then  compiled  into  an  executable  program.  The  compiled 
special-purpose  recognition  program  is  for  a  single  object  model  only  and  can  be  used  to 
classify  instances  of  the  object  in  applications  such  as  bin- picking  tasks.  Object  and  sensor 
modeling  and  analysis  of  the  best  recognition  strategy  are  done  at  compile  time.  This 
partially  alleviates  the  repeated  development  efforts  in  the  traditional  one  system/one 
application  practice. 

®The  EGI  (Extended  Gaussian  Image)  can  be  used  to  achieve  certain  tasks  such  as  constraining  the 
viewing  directions  and  rotations  at  the  stage  of  determining  linear  change.  It  has  the  advantage  of  being 
invariant  under  translation  and  scaling  of  axes  of  the  coordinate  system.  It  also  rotates  in  the  same  way 
as  the  object  it  represents.  However,  it  is  less  powerful  in  cases  of  occlusions  [46,  59]. 

"Sensor  detectability  specifies  under  what  condition  a  sensor  can  detect  a  feature,  while  sensor  relia¬ 
bility  is  a  measure  of  confidence  for  the  detected  features  over  the  detectable  configurations. 
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Figure  5:  Interpretation  Tree  (from  Ikeuchi  et  al.) 

When  applying  the  recognition  program  to  a  scene,  the  highest  object  instance  of  the 
scene  is  first  chosen.  This  is  designed  to  avoid  complications  due  to  partial  occlusions  by 
nearby  instances.  After  the  localization  of  the  highest  object,  the  object  is  picked  up  from 
the  bin.  Recognition  proceeds  to  the  next  highest  object.  This  sequence  of  recognition 
and  action  is  natural  of  bin-picking  task. 


3.3.2  Evaluation 

We  can  view  the  approach  as  being  composed  of  two  steps:  classifying  nonlinear  shape 
changes  (aspect  changes)  and  determining  linear  shape  changes  within  an  aspect.  In  the 
first  step  the  search  is  guided  by  focusing  on  aspects  only  at  the  coarser  level.  The  classified 
aspects  then  serve  as  an  intermediate  representation  and  relatively  cheap  determination 
of  linear  changes  proceed.  Ikeuchi  et  al.  consider  only  three  degrees  of  freedom  in  the 
linear  change:  two  degree  of  freedom  in  translation  on  the  image  plane  and  one  degree 
of  freedom  in  rotation  around  the  optical  axis.  The  camera  model  assumes  orthographic 
projection.  Since  we  have  already  got  a  rough  estimate  on  the  orientation  of  the  object 
by  classifying  aspect  changes,  the  rotations  of  edges  within  an  aspect  are  more  sensitive 
around  the  viewing  direction  than  around  axes  within  the  image  plane.  However,  we  still 
need  the  depth  information  for  picking  the  object  from  the  bin.  A  depth  map  is  used  for 
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resolving  this  problem. 

Although  the  scheme  works  well  for  the  task  at  hand,  bin-picking  of  relatively  simple 
industrial  parts,  it  does  not  easily  generalize  to  more  general  tasks  that  handle  realistic 
scenes,  aside  from  the  techniques  for  automatically  compiling  recognition  programs.  The 
system  assumes  that  the  linear  changes  within  an  aspect  can  be  determined  cheaply  and 
easily.  This  assumption  restricts  the  class  of  objects  it  can  handle,  that  is,  polyhedral 
objects.  The  polyhedral  approximations  of  smooth  objects  are  not  stable.  Its  expected 
values  of  features  are  sensitive  to  small  variations  in  viewing  direction.  For  instance,  the 
polygonal  approximation  to  a  circle  is  sensitive  to  small  variations  in  the  image.  A  slight 
rotation  of  the  approximating  polygon  results  in  drastic  errors  in  determining  the  trans¬ 
formation  [34].  Most  real-world  objects  display  complex  shapes  whose  nonlinear  shape 
changes  can  not  be  easily  classified  into  aspects.  Although  polyhedral  approximation  to 
curved  objects  can  sometimes  be  used  with  the  aid  of  feature  reliability,  it  nevertheless 
introduces  matching  errors. 

The  tradeoff  between  runtime  efficiency  and  speed  and  storage  space  is  evident  in 
Ikeuchi  et  al’s  decision  of  precompiling  the  model  and  sensor  features  and  the  recognition 
strategy  into  an  executable  program  and  merely  applying  the  program  at  runtime,  as 
opposed  to  invoking  the  features  and  constraints  and  the  analysis  of  how  to  use  them  at 
runtime.  The  size  of  the  interpretation  tree  can  potentially  be  very  large.  In  the  worst 
case,  the  size  is  exponential  in  the  number  of  distinct  aspects  of  the  object  model,  which 
can  be  large  for  objects  that  lack  planar  faces.  Therefore  judicious  choice  of  features  is 
of  importance.  A  cutoff  in  the  enumeration  of  appearances  is  used,  with  some  sacrifice  in 
accuracy. 

Precompiling  constraints  into  a  tree  also  predetermines  the  search  path.  The  method 
is  sensitive  to  small  variations  in  the  geometries  of  objects  and  sensors,  for  example  defects 
in  objects  due  to  manufacturing  processes.  Although  Ikeuchi  et  al.  model  some  of  the 
effect  with  feature  reliability,  this  trades  off  some  runtime  flexibility.  For  bin-picking 
tasks  starting  recognition  at  the  highest  instance  does  reduce  the  possibility  of  part  of 
it  occluded  by  other  instances.  Future  research  should  generalize  this  method  to  handle 
occlusions  due  to  different  objects  in  general  scenes.  Backtracking  of  the  interpretation 
tree  search  is  needed  to  handle  the  occlusions. 

3.4  Related  work 

This  section  surveys  related  work  to  the  three  approaches  we  have  just  discussed.  One 
group  deals  with  line  drawing  interpretation  where  the  paradigm  of  consistent  labeling 
has  its  roots.  The  others  either  are  similar  to  the  three  approaches  or  represent  slightly 
different  approaches  in  the  area  of  model-based  recognition. 
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3.4.1  Interpreting  line  drawings: 

1.  Guzman: 

Guzman  [41]  started  the  systematic  work  on  interpreting  polyhedral  line  drawings. 
The  task  is  to  partition  a  picture  of  line  drawings  into  distinct  bodies  of  objects. 
The  pictures  are  assumed  to  contain  polyhedral  objects  with  trihedral  vertices  only. 
The  reason  for  choosing  the  polyhedral  world  was  that  it  possesses  some  essential 
aspects  of  scene  analysis  (image  formation,  models,  spatial  relationship  of  scene 
objects  —  occlusion,  shadowing,  support,  etc.),  and  yet  is  simple  enough  to  admit 
computationally  tractable  analysis  at  the  time  (around  the  60’s  and  early  70’s)  [87]. 
The  approach  presents  a  theory  about  how  to  place  links  on  pairs  of  two  regions 
that  belong  to  a  single  object  body,  based  on  an  analysis  of  junction  formations. 
Given  a  picture,  first  it  labels  local  pairs  of  regions  according  to  the  theory.  Then 
the  local  evidence  is  grouped  to  produce  a  legal  interpretation.  This  works  well  on 
many  examples;  but  it  fails  on  some  simple  scenes.  To  patch  up  the  hole,  additional 
heuristic  rules  jure  added.  The  approach  claims  to  use  no  knowledge  of  prototypical 
bodies.  The  basic  flaw  of  Guzman’s  method  is  that  it  was  necessary  to  add  and 
modify  rules  to  handle  counterexamples. 

2.  Huffman  &  Clowes: 

Huffman  and  Clowes  [49,  20]  suggested  that  the  remedy  to  Guzman’s  scheme  comes 
from  a  complete  analysis  of  the  junction  types.  Edges  forming  a  junction  can  be 
labeled  as  one  of  convex,  concave,  or  occluding  edges.  Since  there  are  only  a  small 
number  of  legal  labelings  for  each  junction  type,  all  the  legal  labelings  of  junctions 
can  be  enumerated  to  form  a  catalogue  of  legal  labels.  Each  edge  bridges  two 
junctions,  if  both  of  them  are  visible.  The  two  labels  at  the  ends  of  the  edge 
have  to  be  consistent.  Interpretation  is  merely  a  tree  search  that  backtracks  when 
it  comes  to  a  label  that  is  inconsistent  with  what  is  already  known.  In  a  later 
paper  Huffman  [51]  generalizes  the  idea  to  handle  polyhedral  objects  with  arbitrary 
number  of  edges  forming  a  junction,  using  decision  criteria  in  a  dual  space  to  check 
for  legality  of  a  label. 

3.  Waltz: 

Waltz  [85]  extended  the  semantic  catalogue  of  Huffman/Clowes  to  handle  shadows. 
He  presented  an  algorithm  that  first  locally  eliminates  all  the  illegal  labels  at  each 
junction  by  checking  pairwise  consistency  across  each  edge,  and  then  proceeds  to 
do  a  global  search.  The  remarkable  thing  about  Waltz’s  algorithm  is  that  the 
filtering  pass  (the  first  step)  eliminates  almost  all  the  impossible  interpretations  and 
achieves  drastic  speedup.  The  approach  relies  on  two  basic  components:  a  catalogue 
of  semantic  labels  and  combination  rules  and  a  single  mechanism  —  constraint 
propagation  —  that  uses  the  catalogue  for  consistent  labeling.  He  thus  showed  that 
the  interpretation  of  a  scene  is  easy  if  we  can  come  up  with  a  dictionary  of  primitives 
—  “words”  —  and  a  grammar  to  put  them  together  to  produce  legal  “sentences”. 
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The  Waltz  algorithm  shows  that  the  more  we  know  about  the  world,  the  smaller  the 
search  space  is.  If  we  can  discover  and  represent  all  the  intrinsic  constraints  of  some 
complex  objects,  we  should  be  able  to  apply  the  algorithm  in  much  the  same  way 
as  in  the  polyhedral  world.  For  example,  curved  objects  offer  rich  information  such 
as  local  curvatures.  Cueful  analysis  should  lead  to  a  successful  program  to  handle 
general  (smooth)  surfaces  [18].  Brady  [10]  also  presents  a  primal  surface  sketch  to 
describe  significant  changes  such  as  steps,  roofs,  and  ridges. 

4.  Other  extensions: 

Later  research  has  extended  Waltz’s  work  in  several  directions:  Sugihara  [79]  looks 
for  new  sources  of  constraints  such  as  coplanarity  of  vertices  and  edges  that  lie  in 
the  same  face  of  the  polyhedral  object.  Mackworth  [66]  and  Draper  [22]  work  in 
the  gradient  space  and  explore  positions  and  orientations  of  edges  and  surfaces. 
Chakravarty  [18]  and  Malik  [68]  generalized  the  method  to  handle  curved  objects. 

5.  Remarks: 

It  has  been  shown  that  there  are  picture  configurations  that  are  locally  consistent 
but  globally  inconsistent  [49].  The  reason  seems  to  be  that  the  local  constraints  of 
comer  sorts  are  incapable  of  expressing  global  constraints  [67].  One  way  to  overcome 
this  is  to  explore  stronger  local  constraints.  Nguyes  [74]  showed  that  by  exploring 
local  topological  constraints  local  consistency  implies  global  consistency.  Another 
method  is  to  explore  some  degree  of  global  constraints.  The  knowledge  of  models 
can  be  used  to  eliminate  globally  inconsistent  interpretations.  This  naturally  leads 
to  model-based  recognition  that  hypothesizes  a  match  to  a  model  and  then  verifies 
the  match. 

One  important  lesson  we  learned  from  the  work  on  line  drawing  interpretation  is  that 
in  order  to  improve  the  interpretation  of  a  scene  we  should  first  search  for  better 
descriptions  of  the  scene  and  model  features,  and  then  look  for  more  evidence. 
Throwing  in  ad  hoc  heuristics  to  deal  with  counterexamples  should  only  be  a  last 
resort  [87]. 


3.4.2  Searching  a  tree  to  match  model  and  image  features 
1.  Faugeras  &  Hebert: 

Faugeras  and  Hebert  [25]  presented  a  3-D  recognition  and  positioning  algorithm 
using  geometric  matching  between  primitive  surfaces.  The  method  (1)  estimates 
rotation  and  translation  (when  the  transformation  admits  decomposition)  by  pairing 
primitive  surfaces  to  those  of  models,  and  then  (2)  iteratively  refines  the  estimate 
using  a  least  square  method  that  minimizes  a  measure  of  matching  quality.  Rotation 
is  represented  as  quaternions.  Recognition  is  a  tree  search  procedure.  The  search 
is  pruned  by  a  local  consistency  measure  that  precludes  strongly  inconsistent  pairs 
from  consideration. 
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2.  HYPER7  (Ayacht  &  Faugeras): 

HYPER  [26]  identifies  and  locates  objects  by  generating  and  verifying  hypotheses 
coupled  with  a  recursive  estimation  of  the  model  to  scene  transformation.  It  handles 
2D  objects  on  a  flat  surface  and  is  robust  to  partial  occlusions,  shadows,  and  scal¬ 
ings.  The  system  has  been  successfully  tested  in  a  large  number  of  different  scenes 
containing  partially  overlapping  industrial  parts.  Local  and  compact  descriptions 
for  describing  both  the  models  and  scenes  are  used.  Hypotheses  are  generated  to 
estimate  the  model  to  scene  transformation  by  matching  a  few  privileged  segments 
of  descriptions  to  those  in  the  scene  and  are  ranked  according  to  a  measure  of  qual¬ 
ity.  Only  the  best  hypotheses  are  evaluated.  The  transformation  is  then  evaluated 
by  matching  additional  segments  and  recursively  updating  parameters  of  the  trans¬ 
formation  using  a  least  square  method  (Kalman  filter).  The  privileged  segments  of 
descriptions  are  served  to  focus  search  onto  a  small  number  of  distinct  features.  The 
hypothesized  match  constrains  the  search  relative  to  the  model.  This  is  similar  to 
that  of  the  alignment  method  [83]. 

3.  Local-feature-focus  Method: 

This  method  [7]  uses  local  features  such  as  regions  and  corners  of  2D  parts  to 
direct  the  generation  of  hypotheses.  It  finds  one  feature,  the  focus  feature,  and 
uses  it  to  predict  a  few  nearby  features  to  look  for.  A  graph-matching  technique  is 
used  to  identify  the  largest  cluster  of  image  features  matching  a  cluster  of  object 
model  features.  The  matching  is  solved  as  a  maximal-clique  problem,  which  is 
very  complex.  However,  the  local-feature-focus  method  prunes  the  tree  search  by 
cutting  down  the  branching  factor  of  the  tree.  The  key  to  this  method  is  to  select 
the  best  focus  features  and  the  most  useful  nearby  features,  through  the  ranking  of 
features,  so  that  only  the  best  hypotheses  axe  generated  and  the  time  for  (expensive) 
verification  step  is  minimized.  The  selection  of  focus  features  is  done  automatically. 

4.  SDPO: 

The  task  of  3DP0  [8]  is  the  bin-picking  with  3D  objects,  possibly  occluded.  Range 
data  are  available  as  input.  This  method  generates  hypotheses  and  matches  them 
against  image  features.  Search  is  pruned  by  a  consistency  of  features.  A  few  features 
or  cluster  of  features  are  automatically  selected  from  a  CAD  model  based  on  a 
feature  representation  and  preliminary  planning.  The  image  is  clustered  into  view- 
independent  clusters  of  features.  Hypotheses  are  generated  based  on  the  features 
detected,  possibly  with  the  help  of  the  local-feature-focus  method. 

5.  Grouping  of  features  (Lowe  and  Jacobs): 

Lowe  [65]  presented  a  method  to  recognize  3D  objects  from  2D  images.  The  method 
(I)  forms  groupings  and  structures  in  images;  (2)  uses  a  probabilistic  ranking  to 
reduce  the  size  of  search  during  matching;  and  (3)  matches  spatially  2D  image  and 

rHYPER  stands  for  HYpotheses  Predicted  and  Evaluated  Recursively. 


A. 
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model  by  determining  a  transformation.  The  groups  in  the  image  bridge  the  gap 
between  the  2D  images  and  3D  models. 

Jacobs  [64]  showed  that  initial  grouping  of  features,  such  as  edges,  that  likely  come 
from  same  object  can  reduce  the  size  of  the  search  and  improves  the  accuracy  of 
interpretations.  He  uses  two  types  of  geometric  constraints  to  form  grouping  of 
edges,  the  distance  between  two  edges  and  their  relative  orientation.  The  method 
achieves  a  reductions  in  computations  of  between  a  factor  of  100  and  1500  over  an 
identical  system  that  does  not  use  grouping. 

6.  Ikeuchi  &  Horn: 

Ikeuchi,  Horn,  et  al.  [57]  solved  the  bin-picking  of  3D  objects  by  matching  the  surface 
normal  distribution  of  image  object  against  that  of  the  model.  The  surface  normal 
distribution  of  the  scene  is  calculated  by  the  photometric  stereo  method.  The  scene 
is  then  segmented  into  isolated  regions  using  the  surface  normal  distribution.  Fi¬ 
nally  the  method  determines  the  object  orientation  by  matching  the  surface  normal 
distribution  of  the  target  region  against  that  of  the  known  object  model  using  the 
discrete  Extended  Gaussian  Image. 

7.  Baird: 

Baird  [1]  equates  consistent  matches  with  constraints  in  some  geometric  space  and 
uses  a  linear  programming  method  to  solve  this  geometric  problem. 


3.4.3  Using  models  to  filter  matches 
Alignment  Method: 

The  alignment  method  [83]  searches  directly  for  possible  transformation  from  a  model 
to  image  object  using  as  small  a  number  of  points  as  possible  and  discards  incorrect  trans¬ 
formation  by  checking  with  additional  points.  This  is  another  instance  of  hypothesize- and- 
verify  paradigm.  It  recognizes  an  observed  object  by  hypothesizing  the  object’s  identity 
together  with  its  position  and  orientation  and  verifying  the  hypothesis. 

The  alignment  method  first  aligns  an  image  object  with  a  model  by  a  transformation. 
Three  distinct  points  or  surface  normals  are  used  to  determine  the  transformation.  For 
each  of  the  models  in  the  model  library,  a  transformation  to  the  image  object  is  deter¬ 
mined.  Then  the  fitness  of  all  transformations  is  examined  and  ranked  according  to  some 
metric  cf  quality,  for  example  the  simple  distance  metric.  The  best  fit  is  the  correct 
interpretation  of  the  observed  object.  It  is  helpful  to  compare  the  alignment  method 
with  Grimson  et  al.’s  RAF  which  tries  to  find  all  the  correct  interpretations — a  decision 
that  is  purely  driven  by  the  task  to  accomplish.  The  alignment  has  been  generalized  to 
handle  nonrigid  transformations.  Further  extensions  are  carried  out  by  Basri  and  Ullman 
to  recognition  of  objects  with  smooth  surfaces  using  a  curvature  model,  and  recognition 
based  on  linear  combination  of  models  [81,  2]. 
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3.4.4  Automatic  generation  of  recognition  programs 
Goad: 


Goad  [29]  describes  a  method  for  automatically  constructing  special  purpose  recogni¬ 
tion  program.  The  method  is  based  on  precomputing  tables  containing  bounds  on  spatial 
relations  between  selected  pairs  of  image  features  for  small  range  of  viewpoints.  A  search 
for  matches  between  image  edges  and  model  edges  is  made.  Hypotheses  are  checked  by  a 
table  lookup.  The  method  is  claimed  to  be  fast. 


4  A  Perspective  on  Model-based  Recognition 

We  have  evaluated  the  three  approaches  of  ACRONYM,  RAF,  and  Ikeuchi  et  al.  in  the 
previous  section  and  have  discussed  the  strengths  and  weaknesses  of  each  approach.  We 
have  also  suggested  future  directions  for  improvements  in  their  performance. 

In  this  section  we  will  discuss  and  compare  the  three  systems  in  the  larger  context  of 
other  approaches  to  model-based  recognition,  and  place  them  in  the  comparative  frame¬ 
work  defined  by  the  three  axes:  generality,  representation,  and  control,  as  shown  in  Fig¬ 
ure  6.  We  will  identify  where  tradeoffs  are  made  and  analyze  why  they  are  made. 

The  following  types  of  tradeoffs  made  in  the  implementations  will  be  discussed  to 
reveal  how  the  systems’  tasks  affect  the  decisions  underlying  the  tradeoffs: 


efficiency  vs. 
efficiency  vs. 
efficiency  vs. 
efficiency  vs. 


generality; 

accuracy; 

runtime  flexibility; 
storage  space. 


From  the  three  systems  under  study  and  other  work  we  can  see  a  constant  battle 
to  achieve  certain  performance  without  too  much  sacrifice  in  generality.  The  issues  of 
representation  and  control  have  been  raised  again  and  again.  The  ACRONYM  system 
is  the  most  complex  system.  It  is  one  of  a  few  implemented  recognition  systems  that 
are  intended  to  be  general  and  domain-independent  and  has  been  tested  on  some  aerial 
images.  However,  it  would  be  difficult  for  the  general  constraint  manipulation  system  to 
compute  constraints  of  object  parameters  from  an  unconstrained  viewpoint.  ACRONYM 
further  constrains  the  parameter  ranges  of  models  by  restricting  the  range  of  viewpoints, 
when  it  is  applied  to  interpret  real  images,  for  example  the  aerial  images  of  airfields. 
Binford  noted  in  [6],  “there  is  no  profound  reason  why  ACRONYM  could  not  recognize 
aircraft  in  images  taken  at  ground  level,  although  it  will  probably  break  when  tested  on 
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Figure  6:  The  comparative  framework  for  different  approaches 

such  images  because  of  bugs  or  missing  capabilities  that  were  not  exercised  previously.” 
Although  ACRONYM  is  intended  to  be  a  general  system,  in  order  to  work  on  real  images 
many  tradeoffs  have  been  made  along  the  way  from  its  general  theory  to  the  working 
implementation.  The  complexity  of  the  implementation  of  ACRONYM  makes  it  very 
hard  to  test  other  ideas  that  the  designers  had  in  mind.  This  fact  motivates  the  need 
for  reasonable  computational  efficiency  so  that  various  ideas  of  a  design  can  be  tested 
and  improved  afterwards.  Waltz’s  algorithm,  for  example,  is  very  fast  in  interpreting  line 
drawings.  Waltz  [85]  noted  that  because  of  the  speed  of  his  algorithm  he  was  “able  to 
test  the  program  on  many  separate  line  drawings”  and  had  "been  able  to  gain  a  clearer 
understanding  of  the  capabilities  and  ultimate  limitations  of  the  program.” 

RAF  improved  performance  by  specialization  of  domains.  RAF  works  in  a  robot 
sensing  environment  where  accuracy  and  speed  are  the  main  objectives.  Its  input  consists 
of  a  sparse  set  of  data  including  positions  and  surface  normals  which  generates  a  smaller 
interpretation  tree  than  that  of  dense  data.  RAF  tends  to  generate  all  the  possible 
interpretations  on  the  belief  that  false  negatives  are  better  than  false  positives  (false 
positives  are  hard  to  get  rid  of). 

Ikeuchi  et  al.  presented  a  method  to  automatically  generate  a  recognition  strategy 
and  compile  it  into  a  program.  Although  the  principles  of  their  system  are  general,  their 
ability  to  work  out  their  theory  relies  on  the  fact  that  appearances  of  an  object  can 


be  enumerated.  This  is  true  for  simple  industrial  parts  in  most  of  bin-picking  tasks. 
The  efficiency  of  the  method  comes  from  the  precompilation  into  a  table  of  constraints 
and  their  use  in  recognition,  based  on  a  decomposition  of  transformations  into  nonlinear 
changes  and  linear  changes.  More  complicated  objects  do  not  lend  themselves  to  such  a 
clean  separation  of  changes  in  appearances.  It  will  be  an  interesting  research  issue  to  see 
how  this  method  handles  multiple  models  and  occlusions. 

RAF  is  specialized  by  exploring  model  specific  knowledge  at  compile  time.  Its  con¬ 
straints  are  all  precompiled  into  tables  so  that  at  the  runtime  only  table  lookup  is  used  to 
obtain  constraints.  It  is  fast.  In  contrast,  ACRONYM  uses  a  general  constraint  manip¬ 
ulation  system  to  reason  about  constraints  at  runtime.  As  a  result  ACRONYM  is  more 
robust  but  is  slower.  Ikeuchi  et  al.’s  approach  goes  one  step  further  by  precompiling  into  a 
table  the  constraints  and  the  use  of  the  constraints  in  discriminating  hypotheses  of  object 
attitude. 

Both  ACRONYM  and  RAF  use  coordinate-frame  independent,  i.e.,  sensing  geometry 
independent,  representations  for  constraints.  In  ACRONYM,  object  shapes  are  hierarchi¬ 
cally  decomposed  into  parts  which  are  linked  together  by  constraints  in  terms  of  algebraic 
inequalities.  The  constraints  of  RAF  characterize  local  geometry  of  object  shapes.  Ikeuchi 
et  al.’s  method,  however,  adopts  a  viewer- centered  representation  (i.e.,  aspect  represen¬ 
tation)  by  enumerating  all  distinct  aspects  of  an  object.  Errors  are  explicitly  modeled. 
In  general,  the  constraints  should  model  errors  explicitly  and  be  coordinate-frame  free,  in 
addition  to  being  easy  to  compute  and  robust  to  noise  [40].  As  a  result,  the  constraints 
are  invariant  to  viewing  directions.  A  quantitative  estimate  of  feature  detectability  and 
reliability  can  be  obtained. 

The  extent  of  model  knowledge  used  in  controlling  the  search  varies  among  different 
approaches.  Ullman’s  alignment  method  aligns  a  model  with  the  observed  object  by  a 
few  points  and  uses  the  model  to  constrain  the  match  and  verification  afterwards.  The 
constraints  are  thus  global.  3DPO  uses  a  few  distinctive  features  to  find  the  transforma- 
tipn  and  verify  it  with  additional  features.  Therefore  the  search  is  directed  to  distinct 
features  at  the  beginning  of  the  search.  The  few  distinctive  features  are  selected  based  on 
the  knowledge  of  models.  RAF  uses  only  local  geometric  constraints  to  prime  the  search. 
Ikeuchi  et  al.  use  the  model  appearances  to  determine  transformation. 

Globally  computing  transformations  from  models  to  scene  is  more  expensive  than 
locally  checking  for  consistency  between  respective  model  features  and  image  features. 
It  is  not  surprise  to  see  that  the  alignment  method  is  somewhat  slower  than  RAF  [40]. 
Preliminary  grouping  of  features  into  extended  features  which  provide  stronger  constraints 
will  improve  the  quality  of  hypotheses.  Model  verification  is  therefore  cheaper.  RAF  uses 
the  extended  features  (edges)  to  reduce  the  need  for  search.  The  use  of  edges,  as  opposed 
to  points,  cuts  down  both  the  branching  factor  and  the  height  of  the  interpretation  tree. 
ACRONYM  uses  an  edge-linking  algorithm  to  group  edge  segments  together. 

The  nature  of  constraints  used  in  ACRONYM  and  RAF  are  local  geometric  con- 
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straints.  Ikeuchi  et  al.  use  global  properties  such  as  moments  and  the  EGI  (Extended 
Gaussian  Image)  and  as  well  local  properties  like  edges  and  regions. 

Global  features,  such  as  area,  perimeter,  moments,  and  elongation  commonly  used 
in  classical  pattern  recognition  task  for  industrial  parts,  are  sensitive  to  occlusions  and 
illuminations.  They  are  also  more  expensive  to  obtain.  However  they  are  less  suscepti¬ 
ble  to  variations  in  object  local  parameters.  In  many  cases,  global  feature  offer  tighter 
constraints  on  the  search.  Grimson  [30]  showed  that  using  the  constraint  of  more  distant 
pairs  of  points  at  the  beginning  of  the  tree  search  most  effectively  constrains  the  size  of 
the  interpretation  tree  generated.  On  the  other  hand,  local  features  are  invariant  to  global 
changes  such  as  occlusions  and  shadows,  at  the  expense  of  bring  weak  in  constraints.  If 
recognition  of  an  object  is  from  a  known  viewpoint,  then  the  appearance  of  object  is  easy 
to  predict  (computationally  easy  and  cheap).  Some  global  features  and  constraints  can  be 
accurately  predicted.  If  recognition  of  object  is  from  an  unknown  viewpoint,  then,  unless 
we  have  ways  of  enumerating  all  the  possible  view  directions  (as  Ikeuchi  et  al.’s  use  of 
aspects),  local  constraints  are  more  applicable. 

Dual  space  representations  of  images  make  the  implicit  properties  of  the  original  images 
explicit.  Examples  of  dual  spaces  include  the  gradient  space  representation  of  points, 
edges,  and  faces  [50]  and  the  EGI  (Extended  Gaussian  Image)  [46]  for  surface  orientations. 
As  a  result,  certain  type  of  analysis  is  very  easy  in  the  dual  spaces.  It  is  interesting  to 
note  that  some  dual  space  representations  can  convert  global  features  into  local  ones.  For 
example,  the  global  periodicity  of  time  signals  are  local  in  the  frequency  domain  through 
Fourier  transform. 


5  Towards  General  and  Robust  Recognition  Sys¬ 
tems 

5.1  Versatile  recognition  systems 

A  versatile  recognition  system  has  to  address  the  issues  of  handling  a  broad  range  of  tasks 
and  integrating  multiple  visual  cues.  This  requires  that  the  system  employ  mostly  view- 
independent  and  task-domain-independent  modeling  process  and  is  robust  to  occlusions 
and  shadows  in  scenes  and  noise  in  measurements.  To  deal  with  complex  scenes  the 
system  should  use  whatever  available  cues,  such  as  edge,  color,  shading,  texture,  motion, 
and  stereo,  to  name  a  few,  to  accomplish  the  task. 

We  will  discuss  the  steps  towards  a  general  and  robust  recognition  machine,  based  on 
some  of  the  machinery  developed  so  far. 
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5.2 


Better  modeling  and  representations 

5.2.1  Modeling  and  representing  realistic  scenes 

Many  systems  do  not  generalize  to  deal  with  occlusions,  poor  lighting,  and  shadows.  Some 
are  susceptible  to  noise  in  measurements  and  variations  in  sizes  of  objects.  ACRONYM 
uses  volumetric  primitives,  the  generalized  cones,  to  represent  objects.  RAF  and  Ikeuchi 
et  al.  approximate  the  world  with  polyhedrons.  Ikeuchi  et  al.  also  use  feature  reliability 
to  discuss  the  crudeness  of  the  approximation. 

To  accomplish  many  difficult  tasks,  objects  and  sensors  in  general  should  be  modeled 
and  represented  explicitly  to  reveal  their  intrinsic  properties  and  expose  their  constraints 
both  qualitatively  and  quantitatively.  Presently  in  many  applications,  models  are  usually 
specified  in  terms  of  geometric  properties  only,  partially  because  of  the  tasks  to  accom¬ 
plish  and  the  availability  of  sensed  information.  Study  on  image  formation  may  lead  to 
discoveries  of  other  constraints  on  objects.  Horn  [48]  has  investigated  the  problem  of 
recovering  shape  from  shading  using  image  brightness  cues.  Ikeuchi  et  al.  have  taken  an 
important  step  towards  sensor  modeling  [62].  More  recently,  fractal  representation  for 
natural  forms  has  been  suggested  [75].  Much  research  remains  to  be  done. 

Most  systems  assume  objects  to  be  rigid.  However  many  real  world  objects,  like 
moving  bodies  of  animals,  are  nonrigid.  Techniques  for  recognition  of  nonrigid  objects 
need  to  be  developed.  Grimson  [35]  and  Ullman  [83]  have  attempted  the  problem. 

The  use  of  class/subclass  relations  in  ACRONYM  is  an  interesting  approach  to  handle 
generic  classification.  ACRONYM  does  not  commit  itself  to  make  a  particular  classifica¬ 
tion  unless  it  has  enough  information  to  do  so.  Some  tasks,  such  as  recognizing  a  car  as  a 
generic  object  consisting  of  a  frame  mounted  on  four  wheels,  as  opposed  to  a  specific  car 
model,  like  a  Toyota  Corolla,  require  the  recognition  systems  to  have  some  knowledge  of 
generic  classes.  However  for  objects  that  do  not  have  stable  and  unique  decompositions, 
more  work  should  be  done-to  discover  what  constitute  their  class/subclass  relations. 


5.2.2  Modeling  errors 

A  robust  system  must  realistically  model  measurement  errors  in  sensory  data.  RAF  uses 
an  “error  ball"  for  uncertainty  in  distance  measurements  and  an  “error  cone”  in  angle 
measurements.  This  type  of  error  models  imposes  uniform  weighing  across  the  error 
region.  Ikeuchi  et  al.  use  a  more  realistic  error  model.  In  general,  an  error  model  should 
use  some  sort  of  distribution  that  reflects  the  underlying  models  of  objects  and  sensors. 
One  possibility  is  the  Gaussian  distribution  [17].  Some  statistical  methods  might  be  useful 
here.  Nonuniform  error  modeling  can  be  expected  to  improve  the  accuracy  and  reduce 
the  need  for  search. 
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Figure  7:  The  use  of  model  knowledge  for  hypothesize- and- verify  in  RAF  and  the  Align¬ 
ment  Method. 

5.3  Further  reduction  on  search 

5.3.1  Use  of  model  knowledge 

As  we  noted  previously,  recognition  can  be  viewed  as  solving  two  problems  [311:  what  is 
in  the  scene  and  where  the  object  is.  RAF  solves  what  by  hypothesizing  the  identity  of 
an  observed  object  and  then  determines  where  by  verifying  the  hypothesis.  At  the  step  of 
what  ',idy  local  constraints  of  the  model  is  used.  The  verification  (model  test)  is  done  after 
what  is  solved.  The  alignment  method  solves  where  before  what.  In  this  sense  alignment 
method  uses  where  (performing  the  model  test  right  at  the  beginning)  to  constrain  the 
search  for  what. 

The  extent  to  which  knowledge  of  models  is  used  in  generating  the  hypotheses,  hy¬ 
pothesizing  the  identity  of  an  image  object,  or  predicting  its  3D  position  and  orientation 
as  well,  depends  on  the  cost  of  computing  a  transformation  relative  to  that  of  exhaustive 
matching  of  features.  Figure  7  compares  RAF  with  the  Alignment  Method.  It  shows  at 
which  level  the  matches  occur,  and  how  much  knowledge  of  the  models  is  hypothesized. 


5.3.2  Intermediate  representations  and  search  focus 

In  a  cluttered  world  of  complicated  objects  recognition  soon  becomes  computationally 
intractable.  The  search  space  for  all  the  matches  to  be  considered  can  be  enormous.  The 
size  of  search  can  be  reduced  in  two  directions: 

1.  using  intermediate  representations  (e.g.  grouping  of  edges); 
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2.  focusing  attention  of  search; 

If  we  view  recognition  as  a  mapping  from  the  sensed  data  to  labeled  objects,  then 
the  search  for  the  correct  matches  is  to  develop  a  search  tree  (in  [33]  the  interpretation 
tree).  It  is  too  expensive  to  conduct  a  brute-force  matching.  Intermediate  representations 
are  needed  to  bridge  the  semantic  distance  between  the  input  and  the  output  of  the 
recognition  system. 

Dense  features  can  be  grouped  into  extended  features  according  to  some  similarity 
measure  (for  example  [64]).  The  matches  are  then  carried  out  on  the  intermediate  repre¬ 
sentation,  the  extended  features.  The  grouping  of  features  cuts  the  original  search  tree 
into  a  set  of  shallower  trees.  The  problem  here  is  to  choose  the  criteria  for  grouping. 
The  grouping  techniques  of  [65,  64]  are  successful  attempts  in  this  direction.  It  is  shown 
that  the  initial  grouping  of  features  reduces  the  size  of  search;  more  specifically,  it  cuts 
down  both  the  height  and  the  branching  factor  of  the  interpretation  tree,  and  increases 
the  robustness  of  the  system  to  small  variations  in  object  and  image  parameters.  Yip  [88] 
used  intermediate  representations  of  the  phase  space  of  dynamical  systems  to  bridge  the 
input  with  the  output  in  the  area  of  automatically  recognizing  Hamiltonian  systems. 

Alternatively,  the  original  set  of  dense  features  can  be  subdivided  into  several  subsets. 
Each  of  the  subsets  is  then  matched  to  the  model.  The  first  successful  match  constrains 
quantitatively  the  matches  on  the  rest  of  subsets.  The  branching  factor  of  the  search  tree  is 
cut  down  afterwards.  What  subset  to  try  first  affects  the  performance.  Finding  distinctive 
features  [8],  such  as  locations  of  holes  and  comers,  and  matching  them  against  the  model 
falls  somewhat  into  the  scheme  of  feature  subdivision.  Alignment  method  is  another 
instance  of  the  subdivision  [83],  The  difficulty  is  to  decide  what  constitute  distinctive 
features.  More  work  should  be  done  on  how  to  find  and  extract  distinct  features  reliably. 

Putting  the  most  effective  constraints  at  the  beginning  of  the  search  essentially  con¬ 
strains  the  branching  of  the  search  tree.  How  to  rank  constraints  with  their  effectiveness, 
especially  when  they  come  from  different  sources,  remains  an  open  research  problem.  The 
answer  will  be  some  metric  of  quality  for  constraints. 


5.3.3  Remark 

Since  we  have  a  fairly  good  understanding  of  the  effectiveness  of  simple  geometric  con¬ 
straints,  more  work  should  be  directed  towards  (1)  the  discovery  of  better  constraints;  (2) 
the  integration  of  constraints  from  multiple  visual  cues;  and  (3)  the  study  of  their  com¬ 
bined  effects.  To  achieve  performance  comparable  with  human  beings,  using  a  library  of 
techniques  we  have  developed  and  invoking  them  for  appropriate  situations  at  appropriate 
time  is  key. 
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6  Conclusion 


The  principal  issues  in  machine  recognition  —  generality,  representation,  and  control  — 
have  been  identified  and  their  principles  and  requirements  have  been  motivated.  We 
have  presented  a  comparative  framework  for  the  evaluation  of  different  approaches  to 
machine  recognition,  particularly  those  of  ACRONYM,  RAF,  and  Ikeuchi  et  al.,  and  have 
discussed  and  compared  these  approaches  with  respect  to  the  three  issues:  generality, 
representation,  and  control.  These  different  approaches  to  machine  recognition  constitute 
an  important  class  of  recognition  methods,  namely  model-based  recognition  that  has  been 
successful  compared  with  its  counterparts.  This  paradigm  formulates  recognition  as  the 
correspondence  between  the  image  objects  and  model  objects  and  searches  for  matches 
that  give  the  correct  interpretations  to  the  image. 

Various  tradeoffs  made  in  the  implementations,  such  as  efficiency  vs.  generality,  ef¬ 
ficiency  vs.  accuracy,  and  efficiency  vs.  runtime  flexibility,  have  been  identified  and 
analyzed  with  respect  to  the  systems’  generality  and  intended  performance.  We  conclude 
that  the  decisions  underlying  these  tradeoffs  are  mainly  task-driven. 

We  have  formulated  the  principles  and  requirements  of  representation  and  control  for 
machine  recognition.  The  issues  of  usefulness,  computability,  invariance,  and  robustness 
of  representations  have  been  addressed  in  the  context  of  ACRONYM,  RAF,  and  Ikeuchi 
et  al.,  along  with  other  work.  We  have  discussed  the  common  control  strategy  used  in 
these  approaches:  hypothesize-and-verify.  The  strategy  hypothesizes  a  match  between 
the  image  object  and  the  model  and  verifies  that  the  match  actually  produces  a  legal 
interpretation  of  the  image.  We  have  pointed  out  the  strengths  and  weaknesses  of  each 
approach  with  respect  to  its  modeling  of  objects,  its  robustness  to  errors,  its  control  of 
search,  and  the  extent  of  model  knowledge  used  in  constraining  the  search  for  consistent 
matches,  and  suggested  future  improvements  for  each  approach. 

We  have  discussed  the  steps  towards  general  and  robust  recognition  systems.  Based 
on  our  analysis  on  the  use  of  feature  grouping  and  search  focus  techniques  in  reducing 
the  size  of  search,  we  have  suggested  the  intermediate  representations  to  bridge  the  gap 
between  the  object  models  and  the  images.  We  argue  that  more  realistic  modeling  and 
representation  of  objects  and  errors  and  more  efficient  control  strategy  is  the  key  to 
versatile  recognition  systems. 
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